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Abstract 

Mixture models are powerful statistical models used in many applications ranging from 
density estimation to clustering and classification. When dealing with mixture models, 
there are many issues that the experimenter should be aware of and needs to solve. The 
MixEst toolbox is a powerful and user-friendly package for MATLAB that implements 
several state-of-the-art approaches to address these problems. Additionally, MixEst gives 
the possibility of using manifold optimization for fitting the density model, a feature specific 
to this toolbox. MixEst simplifies using and integration of mixture models in statistical 
models and applications. For developing mixture models of new densities, the user just 
needs to provide a few functions for that statistical distribution and the toolbox takes care 
of all the issues regarding mixture models. MixEst is available at visionlab.ut.ac.ir/mixest 
and is fully documented and is licensed under GPL. 

Keywords: mixture models, mixtures of experts, manifold optimization, expectation- 

maximization, stochastic optimization 


1. Introduction 


Mixture models are an integrated and fundamental component in many machine learn- 
ing problems ranging from clustering to regression and classification ( McLachlan and Peel 
2000). Estimating the parameters of mixture models is a challenging task due to the need 


to solve the following issues in mixture modeling: 


• Unboundedness of the likelihood: This problem occurs when one component gets a 
small number of data points and its likelihood becomes infinite ( Ciuperca et ah . 20031 ). 

• Local maxima: The log-likelihood objective function for esti mating the parame ters of 
mixture models is non-concave and has many local maxima ( Ueda et ah . 2000 ). 

• Correct number of comp onents: In many applica tions, it is needed to find the correct 
number of components ( Khalili and Chen . 12007 ). 


Addressing these issues for a mixture density when it is not available in common mixture 
modeling toolboxes will cost a lot of time and effort for the experimenter. MixEst addresses 
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all these issues not only for already implemented densities, but also for densities that the 
user may implement. By implementing densities, we mean implementing a few simple 
functions which will be briefly discussed in section [3j 

This toolbox provides a framework for applying manifold optimization for estimating the 
parameters of mixture models. This is an important feature of this toolbox, because recent 
empirical evidence shows that manifol d optim i zation can surpa ss expectation maximization 
in the case of mixtures of Gaussians OHosseini and Sral . 120151). It also opens the door for 
large-scale optimization by using stochastic optimization methods. Stochastic optimization 
also allows solving the likelihood unboundedness problem mentioned above, without the 
need of implementing a penalizing function for the parameters of the density. 

While several libraries are available for working with mixture models, to the best of our 
knowledge, none of them offers a modular and flexible framework that allows for fine-tuning 
the model structure or can provide universal algorithms for estimating model parameters 
solving all the problems listed above. A review of features available in some libraries can 
be seen in Section [U 

In the next section, we give a short overview of the toolbox and its features. 


2. About the MixEst Toolbox 

This toolbox offers methods for constructing and estimating mixtures for joint density and 
conditional density modeling, therefore it is applicable to a wide variety of applications 
like clustering, regression and classification through probabilistic model-based approach. 
Each distribution in this toolbox is a structure containing a manifold structure represent¬ 
ing parameter space of the distribution along with several function handles implementing 
density-specific functions like log-likelihood, sampling, etc. Distribution structures are con¬ 
structed by calling factory functions with some appropriate input arguments defining the 
distribution. For example for constructing a mixture of one-dimensional Gaussians with 2 
components, it will suffice to write the following commands in MATLAB: 

Dravn = mvnfactory(1); 

Dmix = mixturefactory(Dmvn, 2); 


As an example of how to evoke a function handle, consider generating 1000 samples from 
the previously defined mixture: 

theta.D{1}.mu = 0; theta.D{1}.sigma = 1; % mean and variance of the 1st component 
theta.D{2}.mu = 5; theta.D{2}.sigma = 2; % mean and variance of the 2nd component 
theta.p = [0.8 0.2]; % weighting coefficients of components 
data = Dmix.sample(theta, 1000); 


Each distribution structure exposes a common interface that optimization algorithms in the 
toolbox can use to estimate its parameters. In addition to the EM algorithm which is a 
commonly implemented method in available libraries, our toolbox also makes optimization 
on manifolds available featuring procedures like early-stopping and mini-batching to avoid 
overfitting. For optimization on manifol ds, our too lbox depe nds on optimization procedures 


of an excellent toolbox called Manopt ([Boumal et a l.. 120141 ). In addition to optimization 


algorithms of Manopt like steepest descent, conjugate gradient and trust regions methods, 
the user can also use our implementation of Riemmanian LBFGS method. 
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3. Model Development 

MixEst includes many joint and conditional distributions to model data ranging from con¬ 
tinuous to discrete and also directional. Some users, however, may want to apply the tools 
developed in this toolbox for mixtures of a distribution not available in the toolbox yet. To 
this end, the user needs to write a factory function that constructs a structure for the new 
distribution. 

Each distribution structure has a field named “M” determining the manifold of its 
parameter space. For example for the case of multivariate Gaussian distribution, this is a 
product manifold of a positive definite manifold and a Euclidean manifold: 

% datadim is the function input argument determining the dimensionality of data 
muM = euclideanfactory(datadim); 
sigmaM = spdfactory(datadim); 

D.M = productmanifold(struct (' mu ' , muM, 'sigma', sigmaM)); 

The manifold of parameter space completely determines how parameter structure is given to 
or is returned by different functions. The structure of parameters for multivariate Gaussian 
would have two fields, a mean vector “mu” and a covariance matrix “sigma”. 

To use the estimation tools of the toolbox, two main functions have to be implemented. 
The weighted log-likelihood (wll) function and a function for computing the gradient of sum- 
wll with respect to the distribution parameters. The syntax for calling the wll function is: 

llvec = D.llvec(theta, data); 

The input argument theta is a structure containing the input parameters of the corre¬ 
sponding distribution. The second input argument data can be either a data matrix or 
a structure having several fields such as the data matrix and weights, which is interpreted 
using the mxe_readdata function. The output argument llvec is a vector with entries 
equal to wll for each datum (each column) in the data matrix. 

The function to compute the gradient of sum-wll has the following syntax: 

llgrad = D.llgrad(theta, data); 

The input arguments are similar to the function llvec. The output argument llgrad is 
a structure similar to the input argument theta returning the gradient of sum-wll with 
respect to each parameter. 

Some other (optional) functions that can be implemented for distributions are: 
init: This is for initializing the estimator using the data. 

estimatede fault: If the maximum wll has a structure that allows fast optimization 
(or has a closed-form solution), this estimator can be implemented in this function. 
When this function is not present, the Riemmanian optimization is called in the 
maximization step of EM algorithm. 

llgraddata: This function computes the gradient of wll with respect to the data. 
It is required in some special cases such as when the distribution is used as the 
radial component of an elliptically-contoured distribution or as the components in 
independent component analysis. 
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11: This function is sum-wll (sum of the output vector of llvec function). Sometimes 
it is faster to write this function differently than just calling llvec and summing up 
its output vector. 

Two other functions that can be used in the split-and-merge algorithms to avoid local 
maxima of mixture models are kl (for computing KL-divergence) and entropy (for com¬ 
puting entropy). If the user wants to evoke a maximum-a-posteriori estimate, the functions 
penalizerparam, penalizercost and penalizergrad need to be implemented. 


4. Feature Comparison 


To demonstrate the richness of features in MixEst, we are comparing its features with 
several other well-known packages in Table [TJ Among many toolboxes available for mix¬ 
ture modeli ng, we select tho s e tha t are feat ure-rich and representativ e. These p ackages 
are S klearn ( Pedreeosa et all l201lll . Mclust ( Fralev and Rafterv . 199(1 ). FlexMix ( Leisch . 
2004 1. Bayes Net ( Murphvl . l200ll l and MixMod ( Biernacki et al. . 20061 ). We include Bayes 


Net to demonstrate what a generic Bayesian graphical modeling toolbox can do. Sklearn is 
a powerful machine learning toolbox containing many tools, among others tools specific for 
mixture modeling. MixMod also provides bindings for Scilab and Matlab. 


Table 1: Feature comparison of our toolbox and some other well-known packages. Different 
rows correspond to the following specifications of different toolboxes: 1. Pro¬ 
gramming language; 2. Approaches for solving local minima problem (SM stands 
for split-and-merge approach, IDMM for infinite dirichlet mixture models, HC 
stands for initialization using hierarchical clustering); 3. Manifold optimization; 
4. Bayesian approaches for inference (MAP stands for maximum-a-posteriori, 
VB stands for variational Bayes); 5. Large-scale optimization (SEM stands for 
stochastic EM, MB stands for mini-batching); 6. Having tools for model selec¬ 
tion; 7. Automatic model selection (CSM stands for competitive split-and-merge); 
8. Ease of extensibility; 9. Having mixtures of experts; 10. Having mixtures of 
classifiers; 11. Having mixtures of regressors; 



MixEst 

SKlearn 

Mclust 

FlexMix 

Bayes Net 

MixMod 

# 1 

Matlab 

Python 

R 

R 

Matlab 

C++ 

# 2 

SM 

IDMM 

HC 




# 3 

Yes 

No 

No 

No 

No 

No 

# 4 

MAP 

VB 

MAP 


MAP 

SM 

# 5 

MB 





SEM 

# 6 

Yes 

No 

Yes 

No 

No 

Yes 

# 7 

CSM 

IDMM 





# 8 

Easy 



Easy 

Medium 


# 9 

Yes 

No 

No 

No 

Yes 

No 

# 10 

Yes 

No 

No 

No 

Yes 

No 

# n 

Yes 

No 

No 

Yes 

Yes 

No 
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